-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a NPE issue in GpuRand #11647
Fix a NPE issue in GpuRand #11647
Conversation
Signed-off-by: Firestarman <[email protected]>
b427eb9
to
c256b54
Compare
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really want to know what situation this was in where we missed the checkpoint/restore on retry.
Nondeterministic expressions in Spark have an initialize API
This is how Spark initializes the partition. Because of this the Exec nodes that non-deterministic expressions are allowed to run in are limited. This is so that the plan is guaranteed to not run the non-deterministic expression multiple times for a single row (like in the case of a join), and so that this function can be called properly.
Our goal is to have the checkpoint restore code in all places that these types of expressions could run. If we missed that I want to understand what happened so we can fix that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will just file a follow on issue to investigate this. Because we probably want to look at this in more depth.
I spoke with @jlowe and I think we really want to understand this better. #11649 The problem is that if a retry happens and it is not in a checkpoint/restore, then we will technically get data corruption. It is not 100% data corruption because it is a random, so we get a slightly different random number compared to Spark on the CPU, which is the only reason I am not blocking this from going in. But I really want to understand the situation where this happened. |
close #11646
curXORShiftRandomSeed
is marked astransient
, so it will be null on executors without retry-restore context, leading to this NPE.This fix removes the
transient
forcurXORShiftRandomSeed
,seed
andpreviousPartition
that will be used by the computation on executors.I verified it by the customer case, and it works well. The fix is simple so i don't add any tests.